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Abstract 


The R package RegressionFactory provides expander functions for constructing the 
high-dimensional gradient vector and Hessian matrix of the log-likelihood function for 
generalized linear models (GLMs), from the lower-dinrensional base-distribution deriva¬ 
tives. The software follows a modular implementation using the chain rule of derivatives. 

Such modularity offers a clear separation of case-specific components (base distribution 
functional form and link functions) from common steps (e.g., matrix algebra operations 
needed for expansion) in calculating log-likelihood derivatives. In doing so, RegressionFac¬ 
tory offers several advantages: 1) It provides a fast and convenient method for constructing 
log-likelihood and its derivatives by requiring only the low-dimensional, base-distribution 
derivatives, 2) The accompanying definiteness-invariance theorem allows researchers to 
reason about the negative-definiteness of the log-likelihood Hessian in the much lower¬ 
dimensional space of the base distributions, 3) The factorized, abstract view of regression 
suggests opportunities to generate novel regression models, and 4) Computational tech¬ 
niques for performance optimization can be developed generically in the abstract frame¬ 
work and be readily applicable across all the specific regression instances. We expect 
RegressionFactory to facilitate research and development on optimization and sampling 
techniques for GLM log-likelihoods as well as construction of composite models from GLM 
lego blocks, such as Hierarchical Bayesian models. 

Keywords: negative definiteness, regression, optimization, sampling, monte carlo rnarkov 

chain, hierarchical bayesian. 
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1. Introduction 

Generalized Linear Models (GLMs) (McCullagh and Nelder 1989) are one of the most widely- 
used classes of models in statistical analysis, and their properties have been thoroughly studied 
and documented (see, for example, Dunteman and Ho (2006)). Model training and predic¬ 
tion for GLMs often involves Maximum-Likelihood estimation (frequentist approaches) or 
posterior density estimation (Bayesian approaches), both of which require application of op¬ 
timization or MCMC sampling techniques to the log-likelihood function or some function 
containing it. Differentiable functions often benefit from optimization/sampling algorithms 
that utilize the first and/or second derivative of the function (Press 2007). With proper 
choice of link functions, many GLMs have log-likelihood functions that are not only twice- 
differentiable, but also globally-concave (Gilks and Wild 1992), making them ideal candidates 
for optimization/sampling routines that take advantage of these properties. For example, 
the most common optimization approach for GLMs is Iterative Reweighted Least Squares 
(IRLS) (Gentle 2007, Section 6.8.1). IRLS is a disguised form of Newton-Raphson optimiza¬ 
tion (Wright and Nocedal 1999), which uses both the gradient and Hessian of the function, 
and relies on global concavity for convergence. When Hessian is too expensive to calculate 
or lacks definiteness, other optimization techniques such as conjugate gradient (Press 2007, 
Section 10.6) can be used, which still require the first derivative of the function. Among 
MCMC sampling algorithms, Adaptive Rejection Sampler (Gilks and Wild 1992) uses the 
first derivative and requires concavity of the log-density. Stochastic Newton Sampler (Qi 
and Minka 2002; Mahani, Hasan, Jiang, and Sharabiani 2014), a Metropolis-Hastings sam¬ 
pler using a locally-fitted multivariate Gaussian, uses both first and second derivatives and 
also requires log-concavity. Other techniques such as Hamiltonian Monte Carlo (HMC) (Neal 
2011) use the first derivative of log-density, while their recent adaptations can use second and 
even third derivative information to adjust the mass matrix to local space geometry (Girolami 
and Calderhead 2011). Efficient implementation and analysis of GLM derivatives and their 
properties, therefore, is a key component to our ability to build probabilistic models using 
the powerful GLM framework. 

The R package RegressionFactory contributes to computational research and development 
on GLM-based statistical models by providing an abstract framework for constructing, and 
reasoning about, GLM-like log-likelihood functions and their derivatives. Its modular imple¬ 
mentation can be viewed as code factorization using the chain rule of derivatives (Apostol 
1974). It offers a clear separation of generic steps (expander functions) from model-specific 
steps (base functions). New regression models can be readily implemented by supplying their 
base function implementation. Since base functions are in the much lower-dimensional space 
of the underlying probability distribution (often a member of the exponential family with 
one or two parameters), implementation of their derivatives is much easier than doing so in 
the high-dinrensional space of regression coefficients. A by-product of this code refactoring 
using the chain rule is an invariance theorem governing the negative definiteness of the log- 
likelihood Hessian. The theorem allows this property to be studied in the base-distribution 
space, again a much easier task than doing so in the high-dimensional coefficient space. The 
modular organization of RegressionFactory also allows for performance optimization tech¬ 
niques to be made available across a broad set of regression models. This is particularly true 
for optimizations applied to expander functions, but also applies to base functions since they 
share many concepts and operations across models. RegressionFactory contains a lower-level 
set of tools compared to the facilities provided by mainstream regression utilities such as 
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the glm command in R, or the package dglm (Dunn and Smyth 2014) for building double 
(varying-dispersion) GLM models. Therefore, in addition to supporting research on opti¬ 
mization/sampling algorithms for GLMs as well as research on performance optimization for 
GLM derivative-calculation routines, exposing the log-likelihood derivatives using the mod¬ 
ular framework of RegressionFactory allows modelers to construct composite models from 
GLM lego blocks, including Hierarchical Bayesian models (Gelrnan and Hill 2006). 

The rest of the paper is organized as follows. In Section 2, we begin with an overview of GLM 
models and arrive at our abstract, and expanded, representation of GLM log-likelihoods (2.1). 
We then apply the chain rule of derviatives to this abstract expression to derive two equiv¬ 
alent sets of factorized equations (compact and explicit forms) for computing log-likelihood 
gradient and Hessian using their base-function counterparts (2.2). We use the explicit forms 
of the equations to prove a negative-definiteness invariance theorem for the log-likelihood 
Hessian (2.3). Section 3 discusses the implementation of the aforementioned factorized code 
in RegressionFactory using the expander functions (3.1) and the base functions (3.2). In 
Section 4, we illustrate the use of RegressionFactory using examples from single-parameter 
and multi-parameter base functions. Finally, Section 5 contains a summary and discussion. 


2. Theory 

In this section we develop the theoretical foundation for RegressionFactory, beginning with 
an overview of GLM models. 

2.1. Overview of GLMs 

In GLMs, response variable 1 y is assumed to be generated from an exponential-family dis¬ 
tribution, and its expected value is related to linear predictor x*/3 via the link function g: 

9(E(y)) = x f /3. (1) 

where x is the vector of covariates and (3 is the vector of coefficients. For single-parameter 
distributions, there is often a simple relationship between the distribution parameter and its 
mean. Combined with Equation 1, this is sufficient to define the distribution in terms of 
the linear predictor, x*/3. For many double-parameter distributions, the distribution can be 
expressed as 

My;fl,$) = exp{ y6> +c(y,$)} ( 2 ) 

where range of y does not depend on 0 or <1. This function can be maximized with respect 
to 6 without knowledge of 4>. Same is true if we have multiple conditionally-independent 
data points, where log-likelihood takes a summative form. Once 9 is found, we can find 
(dispersion parameter) through maximization or method of moments, as done by glm in R. 
Generalization to varying-dispersion models is offered in the R package dglm, where both 
mean and dispersion are assumed to be linear functions of covariates. In dglm estimation 
is done iteratively by alternating between an ordinary GLM and a dual GLM in which the 
deviance components of the ordinary GLM appear as responses (Smyth 1989). 

lr To simplify notation, we assume that response variable is scalar, but generalization to vector response 
variables is straightforward. 
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In RegressionFactory, we take a more general approach to GLMs that encompasses the glm 
and dglm approaches but is more flexible. Our basic assumption is that log-density for each 
data point can be written as: 

log (P(?/1 { = /(< x 1 ,^ 1 X J ,(3 J >,y) (3) 

where < a, b > means inner product of vectors a and b. Note that we have absorbed the 
nonlinearities introduced through one or more link functions into the definition of /. For N 
conditionally-independent observations yi, ■ ■ ■ ,vn, the log-likelihood as a function of coeffi¬ 
cients /3 J is given by: 


N 

HP 1 , ... ,/3 J ) = ^2fn(< x^/3 1 >), (4) 

n= 1 

where we have absorbed the dependence of each term on y n into the indexes of the base 
functions f n {u l ,..., u J ). With proper choice of nonlinear transformations, we can assume 
that the domain of L is KJ , where A" 7 is the dimensionality of /3 7 . 

This view of GLMs naturally unites single-parameter GLMs such as Binomial (with fixed 
number of trials) and Poisson, constant-dispersion two-parameter GLMs (e.g. normal and 
Gamma), varying-dispersion two-parameter GLMs (e.g. heteroscedastic normal regression), 
and multi-parameter models such as multinomial logit. It can motivate new GLM models 
such as geometric (see Section 4.3) and exponential, and can even include survival models 
(see, e.g., BSGW package (Mahani and Sharabiani 2014)). Several examples are discussed in 
Section 4. 

Our next step is to apply the chain rule of derivatives to Equation 4 to express the high¬ 
dimensional (yT A' 7 ) derivatives of L in terms of the low-dimensional (J) derivatives of f n ’ s. 
We will see that the resulting expressions offer a natural way for modular implementation of 
GLM derivatives. 


2.2. Application of chain rule 

First, we define our notation for representing derivative objects. We concatenate all J coeffi¬ 
cient vectors, /3 7 ’s, into a single )TL A' 7 -dimensional vector, /3: 

P = (P 1 ’*, • • ■ ,P J ’ t ) t ■ (5) 


The first derivative of log-likelihood can be written as: 


where 


For second derivatives we have: 


G( «= 1 = 

(( dL Y 

( dL YY 
"■ ,[ op J> ’ ’ 

(6) 

{ dL '.)*- 

dL 

dL 

'W K i 

(7) 

K dpj’ 



h (£)=s 


d 2 L 


dP 2 [WW 1 \ ,y , / 


( 8 ) 
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where we have defined H(/3) in terms of J 2 matrix blocks: 


d 2 L 


dpdp 3 


dL 


(9) 


j I.A':/ I./\ 


.W 

Applying the chain rule to the log-likelihood function of Equation 4, we derive expressions for 
its first and second derivatives as a function of the derivatives of the base functions f \, • ■ •, Jn- 


with 


and 


dL_ = y, dfn = dfn^j = x ,j j 
df3 j ^ df3 j ^ dvj n S ’ 

1 n =1 1 n =1 

, = ,dfi df N t 
8 duj ’ ' 


= (xi .x 


J \t 


*■!>•••> A N> 


Similarly, for the second derivative we have: 


d 2 L 


N 


d 2 fn 


N 


0 2 fn 


E u Jn _ ^ jn / j 

d(3 j d(3 j ' ~ ^ dvJdvJ' 


ixi') = X^h^'X-?'. 


( 10 ) 

( 11 ) 

( 12 ) 

(13) 

(14) 


where is a diagonal matrix of size N with n’th diagonal element defined as: 

h jf __ ® 2 fn 
n ~ duJduJ 

We refer to the matrix form of the Equations 10 and 13 as ‘compact’ forms, and the explicit- 
sum forms as ‘explicit’ forms. The expander functions in RegressionFactory use the compact 
form to implement the high-dimensional gradient and Hessian (see Section 3.1), while the 
definiteness-invariance theorem below utilizes the explicit-sum form of Equation 13. 


2.3. Definiteness invariance of Hessian 

Theorem 1. If all f n ’s in Equation 4 have negative definite Hessians AND if at least one 
of J matrices X J = (xj, ..., x^)* is full rank, then L(f3 1 , ..., /3 J ) also has a negative-definite 
Hessian. 


Proof. To prove negative-definiteness of H(/3) (hereafter referred to as H for brevity), we seek 
to prove that p*Hp is negative for all non-zero p in K1 . We begin by decomposing p 
into J subvectors of length K ] each: 



p = (p 1 > t ,...,p J - t ) t . 

(15) 

We now have: 



P Hp 

= v &2L , 

(16) 


- T.P”'(t, 8uldvJ ' •(<«<))P 1 

(17) 


n 3,3' 

(18) 
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If we define a set of new vectors q n as: 


P J,t *n] , 


q n = [p^x* • • 

and use h n to denote the J-by-J Hessian of f n : 

h n = [h{ j 


( 19 ) 


( 20 ) 


we can write: 

P <H P = 22 q n h « Qn- (21) 

n 

Since all h ra ’s are assumed to be negative definite, all q^hn qn terms must be non-positive. 
Therefore, p*Hp can be non-negative only if all its terms are zero, which is possible only if all 
q n ’s are zero vectors. This, in turn, means we must have p J: *x), = 0, V n,j . In other words, 
we must have X-^p- 7 = 0, V j. This means that all X J, s have non-singleton nullspaces and 
therefore cannot be full-rank, which contradicts our assumption. Therefore, p T Hp must be 
negative. This proves that H is negative definite. □ 


Proving negative-definiteness in the low-dimensional space of base functions is often much 
easier. For single-parameter distributions, we simply have to prove that the second derivative 
is negative. For two-parameter distributions, and according to Silvester’s criterion (Gilbert 
1991), it is sufficient to show that both diagonal elements of the base-distribution Hessian 
as well as its determinant are negative. Note that negative-definiteness depends not only on 
the distribution but also on the choice of link function(s). For twice-differentiable functions, 
negative-definiteness of Hessian and log-concavity are equivalent (Boyd and Vandenberghe 
2009). Gilks and Wild (1992) have a list of log-concave distributions and link functions. 


3. Implementation 

RegressionFactory is a direct implementation of compact expressions in Equations 10 and 13. 
These expressions imply a code refactoring by separating model-specific steps (calculation of 
g 3 and h 33 ) from generic steps (calculation of linear predictors X- 7 f3 3 as well as ~X. 3,t g 3 and 
~K 3 ' t h 33 X- 7 ). This decomposition is captured diagramatically in the system flow diagram of 
Figure 1. 

3.1. Expander functions 

Current implementation of RegressionFactory contains expander and base functions for one- 
parameter and two-parameter distributions. This covers the majority of interesting GLM 
cases, and a few more. A notable exception is multinomial regression models (such as logit 
and probit) which can have an unspecified number of slots. The package can be extended in 
the future to accommodate such more general cases. 

Single-parameter expander function 

Below is the source code for regfac. expand, lpar: 
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Figure 1: System flow diagram for RegressionFactory. The expander function is responsible 
for calculation of log-likelihood and its gradient and Hessian in the high-dimensional space 
of regression coefficients. It does so by calculating the linear predictors and supplying them 
to the base function, which is responsible for calculation of log-likelihood and its gradient 
and Hessian for each data point, in the low-dimensional space of the underlying probability 
distribution. The expander function converts these low-dimensional objects into the high- 
dimensional forms, using generic matrix-algebra operations. 
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R> regfac.expand.lpar <- function (beta, X, y, fbasel, fgh = 2, ...) { 

+ # obtain base distribution derivatives 

+ ret <- fbasel(X /*/ beta, y, fgh, ...) 

+ # expand base derivatives 

+ f <- sum(ret$f) 

+ if (fgh == 0) return (f) 

+ g <- t(X) /*/ ret$g 

+ if (fgh == 1) return (list(f = f, g = g)) 

+ xtw <- 0*X 

+ for (k in 1 :ncol (X)) xtw[, k] <- X[, k] * ret$h 
+ h <- t(xtw) /*/ X 

+ return (list(f = f, g = g, h = h)) 

+ } 

beta is the vector of coefficients, X is the matrix of covariates, y is the vector (or matrix) of 
response variable, fbasel is the single-parameter base function being expanded, and fgh is a 
flag indicating whether the gradient or Hessian must be returned or not. The dots argument 
(...) is used for passing special, fixed arguments such as the number of trials in a binomial 
regression. The vectorized function fbase is expected to return a list of three vectors: f, g 
and h, corresponding to the base distribution, its first derivative and its second derivative (all 
vectors of length N or nrow(X)). The second and third elements correspond to g 1 and h 11 in 
our notation. Several design aspects of the code are noteworthy for computational efficiency: 

1. Since h is diagonal, we only need to return the N diagonal elements. 

2. For the same reason, rather than multiplying h by X, we only multiply the vector of 
diagonal elements by each of the K columns of X. 

3. The flag fgh controls whether a) only the function value must be returned (fgh==0), b) 
only the function and its first derivative must be returned (f gh==l), or c) the function as 
well as its first and second derivative must be returned (fgh==2). This allows optimiza¬ 
tion or sampling algorithms that do not the first or second derivative to avoid paying 
an unnecessary computational penalty. Since most often a higher-level derivative im¬ 
plies the need for lower-level derivative(s) (including the function as zero’th derivative), 
and also since the computational cost of higher derivatives is much higher, the fgh flag 
works in an incremental fashion (only 3 options) rather than covering all permutations 
of f ,g,h. 

Two-parameter expander function 

Below is the source code for regfac.expand.2par, the 2D expander function in Regression- 
Factory: 

R> regfac.expand.2par <- function(coeff, X 
+ , Z=matrix(l.0, nrow = nrow(X), ncol = 1) 

+ , y, fbase2, fgh = 2, block.diag = FALSE 

+ ,...){ 

+ # extracting coefficients of X and Z 
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+ K1 <- ncol(X); K2 <- ncol(Z) 

+ beta <- coeff[1:K1] 

+ gamma <- coeff[K1 + 1:K2] 

+ 

+ # obtain base distribution derivatives 

+ ret <- fbase2(X /*/ beta, Z /*/ gamma, y, fgh, ...) 

+ 

+ # expand base derivatives 

+ # function 

+ f <- sum(ret$f) 

+ if (fgh == 0) return (f) 

+ # gradient 

+ g <- c(t(X) 7,*/ ret$g[, 1], t(Z) /*/ ret$g[, 2]) 

+ if (fgh == 1) return (list(f = f, g = g)) 

+ # Hessian 

+ h <- array(0, dim=c(Kl+K2, K1+K2)) 

+ # XX block 

+ xtw <- 0 * X 

+ for (k in 1:K1) xtw[, k] <- X[, k] * ret$h[, 1] 

+ h[1:K1, 1:K1J <- t(xtw) /*/ X 

+ # ZZ block 

+ ztw <- 0 * Z 

+ for (k in 1:K2) ztw[, kj <- Z[, k] * ret$h[, 2] 

+ h[Kl + 1 :K2, K1 + 1:K2] <- t(ztw) 7*7 Z 

+ # XZ and ZX blocks 

+ if ('.block.diag) { 

+ ztw2 <- 0*Z 

+ for (k in 1:K2) ztw2[,k] <- Z[,k]*ret$h[,3] 

+ h[Kl + 1 :K2, 1 :K1] <- t(ztw2)7*7X 

+ h [1 :K1, K1 + 1: K2] <- t(h[Kl + 1:K2, 1:K1]) 

+ } 

+ 

+ return (list(f = f, g = g, h = h)) 

+ } 

Aside from the same performance optimization techniques used for the one-parameter ex¬ 
pander function, the two-parameter expander function has an additional parameter, block. diag. 
When TRUE it sets the cross-derivative terms between the two slots to zero. It can be useful 
in two scenarios: 1) When the full Hessian is not negative definite, but the Hessian for each 
parameter is. Block-diagonalization allows for optimization and sampling techniques that 
rely on this property to be used, at the expense of potentially slower convergence since the 
block-diagonalized Hessian is not accurate, 2) When optimization of one slot can proceed 
without knowledge of the value of the other slot, as in many two-parameter exponential fam¬ 
ily members where the dispersion parameter can be ignored in ML estimation of the mean 
parameter (e.g. in normal distribution). 


3.2. Base distributions 
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Corresponding to the one-parameter and two-parameter expander functions, RegressionFac- 
tory offers many of the standard base distributions used in GLM models. Using the nomen¬ 
clature of glm, current version (0.7.1) contains the following base distributions and link 
functions (* indicates distributions not included in glm software): 

• One-parameter distributions: 

— Binomial (logit, probit, cauchit, cloglog) 

— Poisson (log) 

- Exponential (log) (*) 

— Geometric (logit) (*) 

• Two-parameter distributions: 

— Gaussian (identity / log) 

— Inverse Gaussian (log / log) 

— Gamma (log / log) 

A few points are worth mentioning regarding the choice of base distributions and link func¬ 
tions: 

1. Naming convention: We generally follow this convention for single-parameter distribu¬ 
tions: 

fbasel.<distribution>.<mean link function> 

and this convention for two-parameter distributions: 

fbase2.<distribution>.<mean link function>.<dispersion link function> 

There are can be exceptions. For example, in geometric regression 

fbasel.geometric.logit 

the linear predictor is assumed to be logit of the sucess probability, which is inverse 
of the distribution mean. Thus, technically the link function is -log(mu-l), but for 
brevity we simply refer to this link function as logit. Ultimately, naming conventions 
are less important than the definition of log-likelihood function, which combines the 
distribution and the link functions. 

2. Since the focus of RegressionFactory is on supporting optimization and sampling al¬ 
gorithms for GLM-like models, we are not interested in constant terms in the log- 
likelihood, i.e. terms that are independent of the regression coefficients. Therefore, we 
can omit them from the base functions for computational efficiency. An example is 
the log-factorial term in the Poisson base distribution. (Note that such constant terms 
are automatically differentiated out of the gradient and Hessian.) If needed, users can 
implement thin wrappers around the base functions to add the constant terms to the 
log-likelihood. 

3. Our preference is to choose link functions that map the natural domain of the distri¬ 
bution parameter to the real space. For example, in Poisson distribution the natural 
domain of the distribution mean is the positive real space. The log link function maps 
this natural domain to the entire real space. However, for identity and sqrt link 
functions the range is positive real space. 
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4. We also prefer link functions that produce negative-definiteness for the entire Hessian, 
or at least for Hessian blocks (corresponding to a subset of the base-distribution param¬ 
eters). This allow for more optimization/sampling algorithms that take advantage of 
concavity to be applied to the expanded log-likelihood (according to Theorem 1). 

5. We have chosen to absorb the link functions into the function names and their imple¬ 
mentation, rather than making distribution names and lonk functions parameters of a 
single base function. Doing the latter is certainly possible, offering usability at com¬ 
putational cost. Our current choice is driven by the fact that the primary target of 
RegressionFactory is developers rather than end-users. 


4. Using RegressionFactory 

The most basic application of RegressionFactory is to use the readily-available log-likelihood 
functions and derivatives. For example, one might be developing a Bayesian model where 
the log-likelihood is combined with the prior to form the posterior, which is then supplied 
to a sampling algorithm. Or one might be working on a new optimization algorithm and 
would like to test its correctness and performance on regression log-likelihood functions as an 
important use-case. Users can also supply their own base functions to the expander functions 
of RegressionFactory and readily obtain the log-likelihood and its derivatives. Implementation 
of functions for calculating base distribution derivatives is often quite simple, which can 
significantly reduce the time needed for prototyping a new regression model. 

There are two equivalent approaches for passing the log-likelihood functions to an optimiza¬ 
tion/sampling routine: 1) Pass the expander function as the primary function, and the base 
function as an argument of the primary function, 2) write a thin wrapper that combines the 
expander and base functions, and pass this wrapper function to the optimization/sampling 
routine. If the log-likelihood function must be added to another function (such as a prior), 
then the second approach is the only option where the wrapper implements the logic for 
adding the two functions. Due to its higher versatility as well as higher code readability, we 
recommend the second approach. 

The above point as well as other usage details are illustrated below, with several examples 
from single-parameter and double-parameter distributions. 

4.1. Example 1: Bayesian GLM 

The easiest way to take advantage of RegressionFactory is to utilize its standard GLM 
base functions in custom applications, either for testing the performance of a new optimiza¬ 
tion/sampling technique, or for composing more complex models from these lego blocks. In 
the first example, we show how a Bayesian GLM can be constructed in the RegressionFactory 
framework. 

We begin with a basic implementation of Bayesian logistic regression using flat normal priors 
on each coefficient. First we must load the package into our R session: 

R> library(RegressionFactory) 

Log-likelihood for logistic regression can be readily constructed by applying the single-parameter 
expander function to the binomial base function and setting the number of trials equal to 1: 
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R> loglike.logistic <- function (beta, X, y, fgh) { 

+ regfac.expand.lpar(beta, X, y, fbasel.binomial.logit, fgh, n=l) 

+ } 

We also need a prior for beta, which we assume to be a normal distribution on each of the K 
elements of beta with the same mean (mu.beta) and standard deviation (sd.beta): 

R> logprior.logistic <- function(beta, mu.beta, sd.beta, fgh) { 

+ f <- sum(dnorm(beta, mu.beta, sd.beta, log=TRUE)) 

+ if (fgh==0) return (f) 

+ g <- -(beta-mu.beta)/sd.beta~2 

+ if (fgh==l) return (list(f=f, g=g)) 

+ h <- diag(-l/sd.beta~2, nrow=length(beta)) 

+ return (list(f=f, g=g, h=h)) 

+ } 

We can now combine the likelihood and prior according to Bayes rule to construct the log- 
posterior: 

R> logpost.logistic <- function(beta, X, y, mu.beta, sd.beta, fgh) { 

+ ret.loglike <- loglike.logistic(beta, X, y, fgh) 

+ ret.logprior <- logprior.logistic(beta, mu.beta, sd.beta, fgh) 

+ regfac.merge(ret.loglike, ret.logprior, fgh=fgh) 

+ } 

In the above, we have taken advantage of the utility function regfac.merge for combining 
two lists containing function values and its first two derivatives. 

In order to test the above posterior function, we simulate some data using the generative 
model for logistic regression and estimate the coefficients using glm for reference: 

R> N <- 1000 
R> K <- 5 

R> X <- matrix (runif (N*K, min=-0.5, max=+0.5), ncol=K) 

R> beta <- runif(K, min=-0.5, max=+0.5) 

R> y <- rbinom(N, size = 1, prob = 1/(l+exp(-X°/ 0 *°/ 0 beta))) 

R> beta.glm <- glm(y~X-l, family=”binomial")$coefficients 

We now draw 1000 MCMC samples from the posterior of beta using Stochastic Newton 
Sampler (SNS), via R package sns (Mahani et al. 2014). We are taking advantage of the fact 
that the sum of two negative-definite Hessians is also negative-definite, a condition needed by 
SNS. Also, we assume that mu.beta and sd.beta are both given to provide a non-informative 
prior on beta. Finally, we run sns in non-stochastic mode via the flag rnd=FALSE to allow 
for better comparison of output with glm: 

R> library(sns) 

R> # for more accurate results and better comparison, increase nsmp 
R> nsmp <- 100 
R> mu.beta <- 0.0 
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R> sd.beta <- 1000 

R> beta.smp <- array(NA, dim=c(nsmp,K)) 

R> beta.tmp <- rep(0,K) 

R> for (n in l:nsmp) { 

+ beta.tmp <- sns(beta.tmp, fghEval=logpost.logistic, X=X, y=y 
+ , mu.beta=mu.beta, sd.beta=sd.beta, fgh=2, rnd=FALSE) 

+ beta.smp[n,] <- beta.tmp 
+ } 

R> beta.sns <- colMeans(beta.smp[(nsmp/2+1):nsmp,]) 

R> cbind(beta.glm, beta.sns) 


beta.glm beta.sns 


XI 0.01728161 0.01728161 
X2 -0.57629725 -0.57629722 
X3 -0.55204361 -0.55204359 
X4 0.23282848 0.23282846 
X5 -0.06941221 -0.06941221 


Next, we consider a less-trivial example. We create a hierarchical structure where coefficients 
of J groups are assumed to be pooled from normal distribution. This is a simple example of 
Hierarchical Bayesian models, which due to lack of explanatory variables at the upper level is 
reduced to a random-coefficient model. We begin with data generation to provide the reader 
with a tangible grasp of the assumed generative model: 

R> J <- 20 

R> mu.beta.hb <- runif(K, min=-0.5, max=+0.5) 

R> sd.beta.hb <- runif(K, min=0.5, max=1.0) 

R> X.hb <- list() 

R> y.hb <- list() 

R> beta.hb <- array(NA, dim=c(J,K)) 

R> for (k in 1:K) { 

+ beta.hb[,k] <- rnorm(J, mu.beta.hb[k], sd.beta.hb[k]) 

+ } 

R> for (j in 1:J) { 

+ X.hb[[j]] <- matrix(runif(N*K, min=-0.5, max=+0.5), ncol=K) 

+ y-hb[[j]] <- rbinom(N, size=l, prob=l/(l+exp(-X°/,*°/ 0 beta.hb[j,]))) 

+ } 

Again, we generate glm coefficient estimates for reference. Note that glm treats the groups 
completely independently of each other, i.e. without any pooling: 

R> beta.glm.all <- array(NA, dim=c(J,K)) 

R> for (j in 1:J) { 

+ beta.glm. all [j,] <- glm(y .hb[[j]] ~X.hb[[j]]-l 
+ , family="binomial")$coefficients 

+ } 
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Again, we draw samples from posterior on coefficients using SNS, turning the rnd flag off 
for better comparison. Also for code brevity and maintaining focus on how to use pkgRe- 
gressionFactory, we ignore sampling from the posterior of mu. beta and sd.beta, and assume 
their value is given. We must first construct the log-posteriors. Note that we do not need 
to change the definition of log-posteior, but the interpretation of mu.beta and sd.beta has 
changed fronrn scalars to vector of length K each: 

R> beta.smp.hb <- arra y(NA, dim=c(nsmp,J,K)) 

R> beta.tmp.hb <- array(0.0, dim=c(J,K)) 

R> for (n in 1:nsmp) { 

+ for (j in 1: J) { 

+ beta.tmp.hb[j,] <- sns(beta.tmp.hb[j,], fghEval=logpost.logistic 

+ , X=X.hb[[j]J , y=y.hb[[j]] 

+ , mu.beta=mu.beta.hb, sd.beta=sd.beta.hb, fgh=2, rnd=F) 

+ } 

+ beta.smp.hb[n,,] <- beta.tmp.hb 

+ } 

R> beta.sns.hb <- apply(beta.smp.hb[(nsmp/2+l):nsmp,,], c(2,3), mean) 

We have taken advantage of conditional independence [ref] of the coefficients of each group, 
given the values of mu.beta and sd.beta. We examine the coefficients of the first few groups 
between glm and HB methods: 

R> head(beta.glm.all) 

[,1] [,2] [,3] 

[1,] 0.02469046 -0.02179051 0.09795783 

[2,] 0.23716478 -0.14741078 -0.64943037 

[3,] -0.08602828 0.07878448 0.16795872 

[4,] 0.00178465 0.24283261 -0.24240170 

[5,] 0.17083027 0.32218840 0.13358916 

[6,] 0.22690633 0.11722254 -0.08157556 

R> head(beta.sns.hb) 

[,1] [,2] [,3] [,4] [,5] 

[lj -0.008028756 -0.03689594 0.13273096 -0.36892225 0.27940600 

[2,] 0.166836113 -0.15122282 -0.54964963 -0.14652433 0.10439417 

[3,] -0.103947649 0.05881464 0.19355822 0.05347266 -0.02005739 

[4,] -0.029730149 0.21519779 -0.17819428 -0.26395210 -0.18503945 

[5,] 0.122504007 0.29084126 0.15998503 0.06725949 -0.18384662 

[6,] 0.169507903 0.10188925 -0.03573334 0.26120405 -0.02598995 

Plotting unpooled (glm) and pooled (sns) coefficients shows the typical shrinkage pattern of 
Bayesian models. 

R> plot(beta.glm.all[,1], beta.sns.hb[, 1] 

+ , xlab="Unpooled Coefficients" 


[,4] 

-0.3686893 

-0.1177934 

0.1100804 

-0.2504084 

0.1168579 

0.3395405 


[, 5] 

0.37422102 

0.17251167 

0.03127816 

-0.15572659 

-0.15588975 

0.02823816 
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Figure 2: Pooling of logistic regression coefficients using a hierarchical Bayesian framework 
produces the familiar shrinkage towards the mean effect. 


+ , ylab="Pooled Coefficients") 

R> abline(a=0, b=l) 


4.2. Example 2: Double-parameter GLM with varying dispersion 

As a second example, we consider a double-parameter GLM with varying dispersion, i.e., 
dependent on the covariates. As of version 0.7.1, RegressionFactory contains three double¬ 
parameter base distributions: Gaussian, inverse Gaussian, and Gamma. These double¬ 
parameter distributions can be used in a constant-dispersion or varying-dispersion setting. 
Constant-dispersion scenario is a special case of varying-dispersion scenario where the only 
covariate used to explain the dispersion parameter of the base distribution is intercept. This 
corresponds to the default value of Z in the function regfac.expand.2par. 

First, we load the R package dglm: 


R> library(dglm) 
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To use RegressionFactory, as before we implement a thin wrapper to combine the 2D expander 
with the normal base distribution: 

R> loglike.linreg <- function(coeff, X, y, fgh, vd = F) { 

+ if (vd) regfac.expand.2par(coeff = coeff, X = X, Z = X, y = y 
+ , fbase2 = fbase2.gaussian.identity.log, fgh = fgh, block.diag = F) 

+ else regfac.expand.2par(coeff = coeff, X = X, y = y 

+ , fbase2 = fbase2.gaussian.identity.log, fgh = fgh, block.diag = F) 

+ } 

The boolean flag vd indicates whether we want to use covariates to explain the dispersion or 
not. If FALSE, the model is reduced to ordinary linear regression. Next, we simulate data 
according to the assumed generative model: 

R> N <- 1000 
R> K <- 5 

R> X <- matrix(runif(N*K, min=-0.5, max=+0.5), ncol=K) 

R> beta <- runif(K, min=-0.5, max=+0.5) 

R> gamma <- runif(K, min=-0.5, max=+0.5) 

R> mean.vec <- X°/*°/beta 
R> sd.vec <- exp(X%*%gamma) 

R> y <- rnorm(N, mean.vec, sd.vec) 

We now estimate constant-dispersion and varying-dispersion models using the R commands 
lm and dglm: 

R> # constant-dispersion model 
R> est.glm <- lm(y~X-l) 

R> beta.glm <- est.glm$coefficients 
R> sigma.glm <- summary(est.glm)$sigma 
R> # varying-dispersion model 

R> est.dglm <- dglm(y~X-l, dformula = ~X-1, family = "gaussian", dlink = "log") 
family: gaussian 

R> beta.dglm <- est.dglm$coefficients 

R> gamma.dglm <- est.dglm$dispersion.fit$coefficients 

Finally, we estimate the same models using the expander framework of RegressionFactory: 

R> # constant-dispersion 

R> coeff.smp <- array(NA, dim=c(nsmp, K+l)) 

R> coeff.tmp <- rep(0, K+l) 

R> for (n in l:nsmp) { 

+ coeff.tmp <- sns(coeff.tmp, fghEval=loglike.linreg 
+ , X=X, y=y, fgh=2, vd = F, rnd = F) 

+ coeff,smp[n,J <- coeff.tmp 
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+ } 

R> beta.sns.cd <- colMeans(coeff.smp[(nsmp/2+1):nsmp, 1:KJ) 

R> sigma.sns.cd <- sqrt(exp(mean(coeff.smp[(nsmp/2+1):nsmp, K+l]))) 
R> cbind(beta.glm, beta.sns.cd) 


beta.glm beta.sns.cd 


XI -0.2630246 
X2 -0.4988430 
X3 0.5966928 
X4 -0.3255354 
X5 0.1037526 


-0.2630246 

-0.4988430 

0.5966928 

-0.3255354 

0.1037526 


R> cbind(sigma.glm, sigma.sns.cd) 

sigma.glm sigma.sns.cd 
[1,] 1.031139 1.028558 


R> # varying-dispersion 

R> coeff.smp <- array(NA, dim=c(nsmp, 2*K)) 

R> coeff.tmp <- rep(0, 2*K) 

R> for (n in l:nsmp) { 

+ coeff.tmp <- sns(coeff.tmp, fghEval=loglike.linreg 

+ , X=X, y=y, fgh=2, vd = T, rnd = F) 

+ coeff.smpfn,] <- coeff.tmp 

+ } 

R> beta.sns.vd <- colMeans(coeff.smp[(nsmp/2+1):nsmp, l:Kj) 

R> gamma.sns.vd <- colMeans(coeff.smp [(nsmp/2+1):nsmp, K+1:K]) 
R> cbind(beta.dglm, beta.sns.vd) 


beta.dglm beta.sns.vd 


XI -0.2702744 
X2 -0.5500612 
X3 0.6352290 
X4 -0.3866443 
X5 0.1004077 


-0.2702757 

-0.5500678 

0.6352354 

-0.3866514 

0.1004075 


R> cbind(gamma.dglm, gamma.sns.vd) 


gamma.dglm gamma.sns.vd 


XI 0.3183887 
X2 0.5214325 
X3 0.2689971 
X4 0.6318339 
X5 -0.4145652 


0.3184326 

0.5214858 

0.2689735 

0.6318852 

-0.4145799 


Note that the mean coefficients from dglm and RegressionFactory match exactly in constant- 
dispersion case, but the dispersion parameters do not match since dglm uses a method of mo¬ 
ments to estimate dispersion, rather than log-likelihood maximization. For varying-dispersion 
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scenario, since mean and dispersion coefficients are estimated simultaneously, neither sets 
match exactly between the two methods, but they are very close, and the discrepancy be¬ 
comes smaller for larger data. 

4.3. Example 3: Geometric regression 

In the last example, we illustrate how a new GLM regression can be easily constructed using 
the RegressionFactory framework. This involves three steps: 1) identify a base distribution, 2) 
select the link function(s), and 3) combine 1 and 2 to arrive at the log-likelihood function and 
its derivatives, preferrably to make the Hessian negative-definite. According to Theorem 1, 
this property can be proven in the base-distribution space, which is often quite easy. Consider 
the geometric distribution: 

P(y = k;p) = ( 1- p) fc_1 p. (22) 

Using a logit link function for p, we arrive at the following log-likelihood: 

f(u- y) = - (y u + (1 + y) log (l + e~ u ) ) . (23) 

Concavity of the above function can be easily verified: 

fuu = ~ (1 + y)e u /(1 + e“) 2 < 0 (24) 

The base function fbasel .goemetric. logit implements the above log-likelihood and its first 
two derivatives. To test the function, we first simulate data from the distribution: 

R> N <- 1000 
R> K <- 5 

R> X <- matrix(runif(N*K, min=-0.5, max=+0.5), ncol=K) 

R> beta <- runif(K, min=-0.5, max=+0.5) 

R> y <- rgeom(N, prob = 1/(l+exp(-X°/ 0 *°/ 0 beta) ) ) 

We now use SNS in non-stochastic mode (i.e. Newton optimization) to estimate the coeffi¬ 
cients. We begin by our usual thin wrapper around the expander function to fully implement 
the log-likelihood. 

R> loglike.geometric <- function(beta, X, y, fgh) { 

+ regfac.expand.lpar(beta, X, y, fbasel.geometric.logit, fgh) 

+ } 

R> beta.est <- rep(0,K) 

R> for (n in 1:10) { 

+ beta.est <- sns(beta.est, fghEval=loglike.geometric 
+ , X=X, y=y, fgh=2, rnd = F) 

+ } 

R> cbind(beta, beta.est) 

beta beta.est 
[1,] 0.3631200 0.4260850 

[2,] 0.1219817 0.1695647 

[3,] -0.3779823 -0.4790984 
[4,] -0.4623841 -0.2790712 
[5,] -0.1732932 -0.3219395 
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5. Summary 

We presented R package RegressionFactory, a modular framework for evaluating GLM log- 
likelihood functions and their derivatives. We illustrated its utility in rapidly developing 
composite GLM models such as Hierarchical Bayesian as well as new regression models such 
as geometric and exponential regression. The accompanying definiteness-invariance theorem 
allows us to reason about logl-likelihood Hessian in a much lower-dimensional space. 

Another advantage of our modular implementation is that it allows for performance optimiza¬ 
tion strategies to be readily applied across all GLM models. For example, the linear algebra 
steps contained in the expansion functions regfac.expand, lpar and regfac.expand.2par 
can be thoroughly studied from the following perspectives: 

• Row-major vs. column-major layout of the covariate matrices X 7 ’s, for single-threaded 
and multi-threaded scenarios. 

• Non-Uniform Memory Access (NUMA) implications of memory allocation for X 7 ’s. 

• Loop and cache fusion strategies. 

• Coarse- vs. fine-grained parallelization in composite models such as HB. 

While base functions contain model-specific code, yet they also present broad optimization 
opportunities. For example, they are all vectorized by definition, suggesting that they can 
benefit from optimized Single-Instruction, Multiple-Data (SIMD) implementation. In par¬ 
ticular, access to vectorized transcendental functions can greatly improve the peformance of 
many base functions. Many of the above issues have been studied here (Mahani and Shara- 
biani 2013). A natural next step for RegressionFactory would be to implement the expander 
and base functions in compiled code such as C/C++, which allows for many of the advanced 
optimization techniques to be applicable subsequently. 
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